Optimized text for full unicode and some escape sequences (#129169)#7
Optimized text for full unicode and some escape sequences (#129169)#7MitchLewis930 wants to merge 1 commit intopr_017_beforefrom
Conversation
…9169) Follow-up to elastic#126492 to apply the json parsing optimization to strings containing unicode characters and some backslash-escaped characters. Supporting backslash-escaped strings is tricky as it requires modifying the string. There are two types of modification: some just remove the backslash (e.g. \", \\), and some replace the whole escape sequence with a new character (e.g. \n, \r, \u00e5). In this implementation, the optimization only supports the first case--removing the backslash. This is done by making a copy of the data, skipping the backslash. It should still be more optimized than full String decoding, but it won't be as fast as non-backslashed strings where we can directly reference the input bytes. Relates to elastic#129072.
There was a problem hiding this comment.
Pull request overview
This PR optimizes text parsing for JSON strings containing full Unicode characters and certain escape sequences. The optimization allows the parser to return a direct byte reference to the underlying buffer instead of allocating new strings, extending this capability beyond simple ASCII strings.
Changes:
- Extended the optimized path to support Unicode characters (UTF-8 encoded) and three specific escape sequences (
\",\/,\\) - Modified the parser logic to track string length separately from byte length to correctly handle multi-byte UTF-8 characters
- Added escape sequence handling that removes backslashes from supported sequences while maintaining buffer references
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| ESUTF8StreamJsonParser.java | Implemented optimized text parsing for Unicode and select escape sequences by tracking character length separately, handling UTF-8 multi-byte sequences, and copying buffers with backslash removal |
| ESUTF8StreamJsonParserTests.java | Added test cases for Unicode strings and escape sequences, replaced test helper with more comprehensive randomized input generation supporting various character types and escape sequences |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| protected int stringEnd = -1; | ||
| protected int stringLength; | ||
|
|
||
| private final List<Integer> backslashes = new ArrayList<>(); |
There was a problem hiding this comment.
Using an ArrayList as an instance field that's cleared and reused could lead to memory retention if large strings with many backslashes are parsed. Consider using a smaller initial capacity or switching to a primitive int array with manual size tracking to reduce allocation overhead and memory footprint for the common case of few or no backslashes.
| private final List<Integer> backslashes = new ArrayList<>(); | |
| private final List<Integer> backslashes = new ArrayList<>(4); |
|
|
||
| private int randomCodepoint(boolean includeAscii) { | ||
| while (true) { | ||
| char val = Character.toChars(randomInt(0xFFFF))[0]; |
There was a problem hiding this comment.
Character.toChars() returns an array that could contain two elements for supplementary characters (codepoints > 0xFFFF), but the code only takes the first element. Since randomInt(0xFFFF) limits values to the BMP, this works, but the method name 'randomCodepoint' suggests it should support all codepoints. Either rename the method to clarify it only handles BMP characters, or extend the range and handle supplementary characters properly.
| char val = Character.toChars(randomInt(0xFFFF))[0]; | |
| char val = (char) randomInt(0xFFFF); |
| boolean doesSupportOptimized = true; | ||
| for (int i = 0; i < length; ++i) { | ||
| if (forceSupportOptimized == false && randomBoolean()) { | ||
| switch (randomInt(9)) { |
There was a problem hiding this comment.
The magic number 9 in the switch statement is unclear. This should be documented or replaced with a named constant that explains why 10 possible values (0-9) are used, where case 0 handles escape sequences, case 1 handles unicode escapes, and default handles unicode characters. The probability distribution (1/10 for escapes, 1/10 for unicode escapes, 8/10 for unicode chars) appears intentional but is not obvious.
PR_017